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Abstract We study the performance of first- and second-order optimization 
methods for £i-regularized sparse least-squares problems as the conditioning 
of the problem changes and the dimensions of the problem increase up to one 
trillion. A rigorously defined generator is presented which allows control of the 
dimensions, the conditioning and the sparsity of the problem. The generator 
has very low memory requirements and scales well with the dimensions of the 
problem. 
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1 Introduction 

We consider the problem 

minimize fr{x) := r||x||i -f ]^\\Ax - h\\l, (1) 

where x S K”, || • ||i denotes the .^i-norm, || • II 2 denotes the Euclidean norm, 
T > 0, A G and b € K™. An application that is formulated as in Q is 
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sparse data fitting, where the aim is to approximate n-dimensional sampled 
points (rows of matrix A) using a linear function, which depends on less than 
n variables, i.e., its slope is a sparse vector. Let us assume that we sample m 
data points ( 0 ^, 6 ^), where G M" and G K Vz = 1, 2, • • • , m. We assume 
linear dependence of bi on af. 

bi = ajx + ei Vz = 1, 2, • • • , m, 

where is an error term due to the sampling process being innacurate. De¬ 
pending on the application some statistical information is assumed about vec¬ 
tor e. In matrix form the previous relationship is: 

b = Ax + e, ( 2 ) 


where A G is a matrix with a^’s as its rows and b G K™ is a vector with 

6 i’s as its components. The goal is to find a sparse vector x (with many zero 
components) such that the error \\Ax — b \\2 is minimized. To find x one can 
solve problem 0. The purpose of the £1 norm in 0 is to promote sparsity in 
the optimal solution [T]. An example that demonstrates the purpose of the £1 
norm is presented in Figure[^ Figure[^shows a two dimensional instance where 
n = 2, m = 1000 and matrix A is full-rank. Notice that the data points Ui 
Vz = 1, 2, • • • , rrz have large variations with respect to feature [ 0^(1 Vz, where [-Jj 
is the jth component of the input vector, while there is only a small variation 
with respect to feature [ 0^)2 Vz. This property is captured when problem Q 
is solved with r = 30. The fitted plane in Figure la depends only on the first 
feature [a]i, while the second feature [ 0(2 is ignored because [x *\2 = 0, where 
X* is the optimal solution of Q. This can be observed through the level sets 
of the plane shown with the colored map; for each value of [a]i the level sets 
remain constant for all values of [a] 2 - On the contrary, this is not the case 
when one solves a simple least squares problem (r = 0 in Q). Observe in 
Figure la that the fitted plane depends on both features [a]i and [ 0 ( 2 . 

A variety of sparse data fitting applications originate from the fields of sig¬ 
nal processing and statistics. Five representative examples are briefly described 
below. 


- Magnetic Resonance Imaging (MRI): A medical imaging tool used to scan 
the anatomy and the physiology of a body m- 

- Image inpainting: A technique for reconstructing degraded parts of an im¬ 
age [7]. 

- Image deblurring: Image processing tool for removing the blurriness of a 
photo caused by natural phenomena, such as motion m- 

- Genome-Wide Association study (GWA): DNA comparison between two 
groups of people (with/without a disease) in order to investigate factors 
that a disease depends on [42] . 

- Estimation of global temperature based on historic data |^ . 

Data fitting problems frequently require the analysis of large scale data 
sets, i.e., gigabytes or terabytes of data. In order to address large scale problems 
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(a) regularized (b) I 2 regularized 

Fig. 1: Demonstration of the purpose of the norm for data fitting problems. 


there has been a resurgence in methods with computationally inexpensive iter¬ 
ations. For example many first-order methods were recovered and refined, such 
as coordinate descent alternating direction method 

of multipliers [iiiaiiHiiisiis], proximal first-order methods and first- 

order smoothing methods [S1IS1I30]. The previous are just few representative 
examples, the list is too long for a complete demonstration, many other ex¬ 
amples can be found in mu- Often the goal of modern first-order methods 
is to reduce the computational complexity per iteration, while preserving the 
theoretical worst case iteration complexity of classic first-order methods |29) . 
Many modern first order methods meet the previous goal. For instance, coor¬ 
dinate descent methods can have up to n times less computational complexity 
per iteration [351IM] . 

First-order methods have been very successful in various scientific fields, 
such as support vector machine [46] , compressed sensing El, image processing 
[T^ and data fitting (35] . Several new first-order type approaches have recently 
been proposed for various imaging problems in the special issue edited by M. 
Bertero, V. Ruggiero and L. Zanni [8]. However, even for the simple uncon¬ 
strained problems that arise in the previous fields there exist more challeng¬ 
ing instances. Since first-order methods do not capture sufficient second-order 
information, their performance might degrade unless the problems are well 
conditioned m- On the other hand, the second-order methods capture the 
curvature of the objective function sufficiently well, but by consensus they are 
usually applied only on medium scale problems or when high precision accu¬ 
racy is required. In particular, it is frequently claimed umanniEH] that the 
second-order methods do not scale favourably as the dimensions of the problem 
increase because of their high computational complexity per iteration. Such 
claims are based on an assumption that a full second-order information has 
to be used. However, there is evidence [iBlfT^ that for non-trivial problems, 
inexact second-order methods can be very efficient. 
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In this paper we will exhaustively study the performance of Hrst- and 
second-order methods. We will perform numerical experiments for large-scale 
problems with sizes up to one trillion of variables. We will examine conditions 
under which certain methods are favoured or not. We hope that by the end 
of this paper the reader will have a clear view about the performance of first- 
and second-order methods. 

Another contribution of the paper is the development of a rigorously de¬ 
fined instance generator for problems of the form of Q. The most important 
feature of the generator is that it scales well with the size of the problem and 
can inexpensively create instances where the user controls the sparsity and 
the conditioning of the problem. For example see Subsection |8.9[ where an 
instance of one trillion variables is created using the proposed generator. We 
believe that the flexibility of the proposed generator will cover the need for 
generation of various good test problems. 

This paper is organised as follows. In Sectionwe briefly discuss the struc¬ 
ture of first- and second-order methods. In Sectionj^we give the details of the 
instance generator. In Section]^ we provide examples for constructing matrix 
A. In Section]^ we present some measures of the conditioning of problem Q. 
These measures will be used to examine the performance of the methods in the 
numerical experiments. In Sectionj^we discuss how the optimal solution of the 
problem is selected. In Sectionj^we briefly describe known problem generators 
and explain how our propositions add value to the existing approaches. In Sec¬ 
tion]^ we present the practical performance of first- and second-order methods 
as the conditioning and the size of the problems vary. Finally, in Section]^ we 
give our conclusions. 


2 Brief discussion on first- and second-order methods 

We are concerned with the performance of unconstrained optimization meth¬ 
ods which have the following intuitive setting. At every iteration a convex 
function Qt{v]x) : M" —>• K is created that locally approximates Jt at a given 
point X. Then, function Qr is minimized to obtain the next point. An ex¬ 
ample that covers the previous setting is the Generic Algorithmic Framework 
(GFrame) which is given below. Details of GFrame for each method used in 
this paper are presented in Section 

Loosely speaking, close to the optimal solution of problem Q, the better 
the approximation Q^. of Jt at any point x the fewer iterations are required 
to solve 0. On the other hand, the practical performance of such methods is 
a trade-off between careful incorporation of the curvature of /r, i.e. second- 
order derivative information in Qr and the cost of solving subproblem (|^ in 
GFrame. 

Discussion on two examples of Qr which consider different trade-off follows. 
First, let us fix the structure of Qr for problem Q to be 

Qriy;x) := T\\y\\i + ^\\Ax-b\\‘^ + {A'^{Ax-b)y{y-x) + ^{y-xyH{y-x), (4) 
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1: Initialize xq G and G 

For fc = 0,1, 2,... until some termination criteria are satisfied 
2: Create a convex function Qriy, Vk) that approximates fr in a neighbourhood of 

3: Approximately (or exactly) solve the subproblem 

Xk+I ~ argmin Q{y,yk) (3) 

y 

4: Find a step-size a > 0 based on some criteria and set 


Vk + l — ^k ^iS^k + l ^fc) 


end-for 

5: Return approximate solution x/c+i 

Algorithm : Generic Framework (GFrame) 


where H G is a positive definite matrix. Notice that the decision of 

creating Q^- has been reduced to a decision of selecting JI. Ideally, matrix H 
should be chosen such that it represents curvature information of 1/2|| Aa; — 
at point X, i.e. matrix H should have similar spectral decomposition to A. 
Let B{x) := {u S M" | ||w — x \\2 < 1} be a unit ball centered at x. Then, H 
should be selected in an optimal way: 


min 

Hyo 


Ifriy) - Qr{y,x)\dB. 


(5) 


The previous problem simply states that H should minimize the sum of the 
absolute values of the residual fr — Qr over B. Using twice the fundamental 
theorem of calculus on fr from a; to ?/ we have that <§ is equivalent to 


min 

Hyo 


1 

2 


{y-xy{A^A-H){y 


x) 


dB. 


( 6 ) 


It is trivial to see that the best possible H is simply H = A'^A. However, 
this makes every subproblem (|^ as difficult to be minimized as the original 
problem Q. One has to reevaluate the trade-off between a matrix H that 
sufficiently well represents curvature information of 1/2||Aa: — 6||^ at a point x 
compared to a simple matrix H that is not as good approximation but offers 
an inexpensive solution of subproblem (|^. An example can be obtained by 
setting iL to be a positively scaled identity, which gives a solution to problem 
^ H = Xmax{A'^A)In, where \max{-) denotes the largest eigenvalue of the 
input matrix and /„ is the identity matrix of size n x n. The contours of such 
a function Qr compared to those of function fr are presented in Subfigure 
Notice that the curvature information of function fr is lost, this is because 
nearly all spectral properties of A'''A are lost. However, for such a function 
Qr the subproblem ([^ has an inexpensive closed form solution known as 
iterative shrinkage-thresholding [niET]. The computational complexity per 
iteration is so low that one hopes that this will compensate for the losses of 
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(a) Separable quadratic (b) Non separable quadratic 

Fig. 2: Demonstration of the contours of two different types of function 
which locally approximate function fr at point x. In the left subfigure, function 
Qr is a simple separable quadratic for which, H = A)In in Q, that 

is frequently used in first-order methods. In the right subfigure, function Q^. 
is a non separable quadratic © which is used in some of the second-order 
methods. 


curvature information. Such methods, are called first-order methods and have 
been shown to be efficient for some large scale problems of the form of 0 m- 
Another approach of constructing Qr involves the approximation of the 
fi-norm with the pseudo-Huber function 

n 

+ Xi)^ - (7) 

i=l 

where /i > 0 is an approximation parameter. This approach is frequently 
used by methods that aim in using at every iteration full information from 
the Hessian matrix see for example p]6lll9) . Using Q, problem Q is 

replaced with 

minimize /^(x) := Tip^{x) + - 6|p. (8) 

The smaller fj, is the better the approximation of problem (|^ to 0. The 
advantage is that /.^ in Q is a smooth function which has derivatives of all 
degrees. Hence, smoothing will allow access to second-order information and 
essential curvature information will be exploited. However, for very small fj, 
certain problems arise for optimization methods of the form of GFrame, see 
m- For example, the optimal solution of 0 is expected to have many zero 
components, on the other hand, the optimal solution of 0 is expected to have 
many nearly zero components. However, for small /r one can expect to obtain 
a good approximation of the optimal solution of 0 by solving 0. For the 
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smooth problem ([^, the convex approximation Qt at x is: 

Qr{y;x) := f^{x) + Vf^ixYiy -x) + ^{y- xyV^f^{x){y - x). (9) 

The contours of such a function Q^. compared to function f^- are presented in 
Subfigure |2b[ Notice that Qr captures the curvature information of function 
Jt . However, minimizing the subproblem ^ might be a more expensive opera¬ 
tion. Therefore, we rely on an approximate solution of ([^ using some iterative 
method which requires only simple matrix-vector product operations with ma¬ 
trices A and A'^. In other words we use only an approximate second-order in¬ 
formation. It is frequently claimed nisiiniiiniiss] that second-order methods do 
not scale favourably with the dimensions of the problem because of the more 
costly task of solving approximately the subproblems in ([^ , instead of having 
an inexpensive closed form solution. Such claims are based on an assumption 
that full second-order information has to be used when solving subproblem 
(§. Clearly, this is not necessary: an approximate second-order information 
suffices. Studies in nans] provided theoretical background as well as the pre¬ 
liminary computational results to illustrate the issue. In this paper, we provide 
rich computational evidence which demonstrates that second-order methods 
can be very efficient. 


3 Instance Generator 

In this section we discuss an instance generator for Q for the cases m > n 
and m < n. The generator is inspired by the one presented in Section 6 of 
[32] . The advantage of our modified version is that it allows to control the 
properties of matrix A and the optimal solution x* of 0. For example, the 
sparsity of matrix A, its spectral decomposition, the sparsity and the norm of 
a;*, since A and x* are defined by the user. 

Throughout the paper we will denote the component of a vector, by 
the name of the vector with subscript i. Whilst, the column of a matrix is 
denoted by the name of the matrix with subscript i. 


3.1 Instance Generator for m > n 


Given r > 0, H S and x* G K" the generator returns a vector b G K"* 

such that X* := arg min„, frix). For simplicity we assume that the given matrix 
A has rank n. The generator is described in Procedure IGen below. 

In procedure IGen, given t, A and x* we are aiming in finding a vector h 
such that X* satisfies the optimality conditions of problem 0 

{Ax* —b)G —r9||a;*||i. 


where 3||a:||i = [—1,1]" is the subdifferential of the £i-norm at point x. By 
fixing a subradient g G i9||a;*||i as defined in (lOl and setting e = b — Ax*, the 
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1: Initialize r > 0, A G with m > n and rank n, x* G 

2: Construct g G M”" such that g G 5||3^*||i: 

('{1}, ifx*>0 

Si e < {-1}, if x* < 0 Vi = l,2,---,n (10) 

[[-1,1], ifa:* = 0 


3: Set e = tA{A'^A) 

4: Return b = Ax* + e 

Procedure : Instance Generator (IGen) 


previous optimality conditions can be written as 


A'^e = Tg. (11) 

The solution to the underdetermined system ([TT]) is set to e = tA(A'^A) 
and then we simply obtain b = Ax* + e; Steps 3 and 4 in IGen, respectively. 
Notice that for a general matrix A, Step 3 of IGen can be very expensive. 
Fortunately, using elementary linear transformations, such as Givens rotations, 
we can iteratively construct a sparse matrix A with a known singular value 
decomposition and guarantee that the inversion of matrix A'^ A in Step 3 of 
IGen is trivial. We provide a more detailed argument in Section]^ 


3.2 Instance Generator for m < n 


In this subsection we extend the instance generator that was proposed in 
Subsection |3.1| to the case of matrix A G with more columns than 

rows, i.e. m < n. Given t > 0, B G ]\f g ^ 

the generator returns a vector b G K"* and a matrix A G such that 

X* := argmin„, /r(a:). 

For this generator we need to discuss first some restrictions on matrix A 
and the optimal solution x*. Let 


S:={iG{ 1,2,--- ,n}lx*^0} (12) 


with [S'! = s and A 5 G be a collection of columns from matrix A which 

correspond to indices in S. Matrix must have rank s otherwise problem 
0 is not well-defined. To see this, let sign(a;g) G K® be the sign function 
applied component-wise to Xg, where Xg is a vector with components of x* 
that correspond to indices in S. Then problem ([^ reduces to the following: 

minimize rsign(a: 5 )‘'’a;s -I- i||Asa;s - b\\l, (13) 


where Xs G M®. The first-order stationary points of problem (13) satisfy 


A^gAsXs = —rsign(a: 5 ) -I- ^gb. 
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If rank(As) < s, the previous linear system does not have a unique solution 
and problem 0 does not have a unique minimizer. Having this restriction in 
mind, let us now present the instance generator for m < n in Procedure IGen2 
below. 


1: Initialize t > 0, B £ jjmxm rank m, N £ g jjn with S := 

{1, 2, ■ ■ ■ , s} and s < m 

2: Construct g £ R"* such that g £ d\\x* {1,2, - ■ ■ , m)\\ i: 


( {1}, if < > 0 

Si e < {-!}, if I* < 0 Vi = l,2,---,m 
[[-1,1], if< = 0 


(14) 


3: Set e = tB~'^ g 

4: Construct matrix N E with the following loop: 

For k = 1,2, — m 

No = —=— No, where £ is a random variable in [—1, ll 
|A^7e| ^ ^ J 

end-for 

5: Return A = [B, iV] and h = Ax* + e 

Procedure : Instance Generator 2 (IGen2) 


In IGen2, given r, H, N and x* we are aiming in finding a vector b and a 
matrix N such that for A = [B,N], x* satisfies the optimality conditions of 
problem Q 

{Ax* —b)€ —r9||a;*||i. 

Without loss of generality it is assumed that all nonzero components of x* 
correspond to indices in S' = {1, 2, • • • , s}. By fixing a partial subradient g G 
S||x*(l, 2, • • • , to) 111 as in (14), where a;*{l, 2, • • • , to) G K"* is a vector which 
consists of the first to components of x*, and defining a vector e = b — Ax*, 
the previous optimality conditions can be written as: 

e = TB~'^g and fV^e G r[—1,1]””™. 


(15) 


It is easy to check that by defining N as in Step 4 of IGen2 conditions (15) 
are satisfied. Finally, we obtain b = Ax* + e. 

Similarly to IGen in Subsection (3.1), for Step 3 in IGen2 we have to 


perform a matrix inversion, which generally can be an expensive operation. 
However, in the next section we discuss techniques how this matrix inversion 
can be executed using a sequence of elementary orthogonal transformations. 


4 Construction of matrix A 

In this subsection we provide a paradigm on how matrix A can be inexpen¬ 
sively constructed such that its singular value decomposition is known and its 







10 


Kimon Fountoulakis, Jacek Gondzio 


sparsity is controlled. We examine the case of instance generator IGen where 
m > n. The paradigm can be easily extended to the case of IGen2, where 
m < n. 

Let S e be a rectangular matrix with the singular values CTi, CT 2, • • • , cr„ 

on its diagonal and zeros elsewhere: 

v._ |’diag(CTi,(T2 ,---,CT„) 

O ’ 

'^m—nxn 


where Om-nxn G is a matrix of zeros, and let G{i,j,9) G be a 

Givens rotation matrix, which rotates plane i-j by an angle 6: 




T • • • 0 • • • 0 • • • O' 

0 • • • c • • • —s • • • 0 

c •••0 

0 •••1 


where i,j € {1, 2, • • • ,n}, c = cos 9 and s = sin 9. Given a sequence of Givens 
rotations {G(ik, Jk, we define the following composition of them: 

G = G{ii,ji,9i)G{i2,j2,92) ■ ■ ■ G{iK,jK, 9 k)- 

Similarly, let G{l,p^'d) G be a Givens rotation matrix where l^p G 

{1, 2, • • • , m} and 

G = G(li,Pi,-9i)G(l2,P2,'92) ■ ■■G(Ik,Pk,i^k) 


be a composition of K Givens rotations. Using G and G we define matrix A 


as 


A = (PiGP2)SG\ 


(16) 


where Pi, P 2 G 


are permutation matrices. Since the matrices P 1 GP 2 


and G are orthonormal it is clear that the left singular vectors of matrix A 
are the columns of P 1 GP 2 , S is the matrix of singular values and the right 
singular vectors are the columns of G. Hence, in Step 3 of IGen we simply set 
{A^A)~^ = G{S'^E)~^G'^, which means that Step 3 in IGen costs two matrix- 
vector products with G and a diagonal scaling with S)~^. Moreover, the 
sparsity of matrix A is controlled by the numbers K and K of Givens rotations, 
the type, i.e. {i,j, 9) and (l,p, -&), and the order of Givens rotations. Also, notice 
that the sparsity of matrix A'''A is controlled only by matrix G. Examples are 


given in Subsection 4.1 


It is important to mention that other settings of matrix A in (161 could be 


used, for example different combinations of permutation matrices and Givens 


rotations. The setting chosen in (16) is flexible, it allows for an inexpensive 
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construction of matrix A and makes the control of the singular value decom¬ 
position and the sparsity of matrices A and A'''A easy. 

Notice that matrix A does not have to be calculated and stored. In par¬ 
ticular, in case that the method which is applied to solve problem Q re¬ 
quires only matrix-vector product operations using matrices A and A'^, one 
can simply consider matrix A as an operator. It is only required to predefine 
the triplets {ik,jk,()k) Vfc = I,2,--- ,K for matrix G, the triplets (lk,Pk,Qk) 
yk = 1, 2, • • • , K for matrix G and the permutation matrices Pi and ^ 2 - The 
previous implies that the generator is inexpensive in terms of memory require¬ 
ments. Examples of matrix-vector product operations with matrices A and A'^ 
in case of (16) are given below in Algorithms MvPA and MvPAt, respectively. 


1: Given a matrix A defined as in | |16^ and an input vector x £ R", do 
2: Set yo = X 

For k = 1,2,..., K 
3. Vk — CrJ^yk—i 

end-for 

4: Set yo = P 2 SyK __ 

5: For k = 1,2,..., K 
6 . Vk — — + —1 

end-for 

7: Return Piyj^ 

Algorithm : Matrix-vector product with A (MvPA) 


1: Given a matrix A defined as in ( |16| > and input vector y £ do 
2: Set xq = P^y 
For k = 1, 2,..., iC 
3. —1 

end-for 

4: Set XQ = E'^P^x^ 

For k = 1, 2,..., iC 
5: Xf. = Gx-k+i^k-i 

end-for 
6: Return xk 

Algorithm : Matrix-vector product with (MvPAt) 


4.1 An example using Givens rotation 

Let us assume that m, n are divisible by two and m > n. Given the singular 
values matrix E and rotation angles 9 and f?, we construct matrix A as 


A = (PGP)EG'^ 
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where P is a random permutation of the identity matrix, G is a composition 
of n/2 Givens rotations: 

with 

Ti 

ik = 2k-l, jk = 2k for A: = 1, 2,3, • • • ,- 
and G is a composition oi mj2 Givens rotations: 

G = Gili,Pi,'d)G{l2,P2,'&) ■ ■ ■ ,G{lk,Pk,'l}), - ■ ■ ,G{ljn/2, Pm/2, 1^) 

with 

Tfl 

lk = 2k-l, pk = 2k for fc = 1, 2, 3, • • • , y. 

Notice that the angle 0 is the same for all Givens rotations Gk, this means that 
the total memory requirement for matrix G is low. In particular, it consists only 
of the storage of a 2 x 2 rotation matrix. Similarly, the memory requirement 
for matrix G is also low. 


4.2 Gontrol of sparsity of matrix A and A'^A 


We now present examples in which we demonstrate how sparsity of matrix A 
can be controlled through Givens rotations. 


In the example of Subsection 4.1 two compositions of n/2 and m/2 Givens 
rotations, denoted by G and G, are applied on an initial diagonal rectangular 
matrix 27. If n = 2^ and m = 2n the spa rsity pattern of the resulting ma¬ 


trix A = {PGP)EG'^ is given in Subfigure 


3a 


and has 28 nonzero elements. 


while the sparsity pattern of matrix A is given in Subfigure |4a| and has 16 
nonzero elements. Notice in this subfigure that the coordinates can be clus¬ 
tered in pairs of coordinates (1,2), (3,4), (5,6) and (7,8). One could apply 
another stage of Givens rotations. For example, one could construct matrix 
A = (PGG2P)r(G2G)T, where 


G 2 — G{ii,ji,0)G{i2,j2,0) • ■ ■ ,G{ik,jk,0), ■ ■ ■ jG'(i„/2_i,j„/2-i,0) 


with 


and 


ik = 2k, jk = 2k + l for fc = 1, 2, 3, • • • , - - 1. 


G 2 — G{li,pi,9)G{l2,P2,0) ■ • • ,G{lk,Pk,0),‘ ■ ■ , G(lm/2-l,Pm/2-l, 0) 

with 

TTl 

lk=2k, pk = 2k + l for fc = 1, 2, 3, • • • , y - 1. 

Matrix A = {PGG 2 P)S{G 2 Gy has 74 nonzeros and it is shown in Subfig¬ 
ure |3b[ while matrix A'^ A has 38 nonzeros and it is shown in Subfigure |4b| 





Title Suppressed Due to Excessive Length 


13 


By rotating again we obtain the matrix A = {PGG 2 GP)S{GG 2 Gy in Sub¬ 
figure with 104 nonzero elements and matrix A'^ A in Subfigure |4c| with 
56 nonzero elements. Finally, the fourth Subfigures |3d| and |4d| show matrix 
A = {PG 2 GG 2 GP)E{G 2 GG 2 Gy and A'^ A with 122 and 62 nonzero elements, 
respectively. 


5 Conditioning of the problem 


Let us now precisely define how we measure the conditioning of problem Q. 
For simplicity, throughout this section we assume that matrix A has more rows 
than columns, m > n, and it is full-rank. Extension to the case of matrix A 
with more columns than rows is easy and we briefly discuss this at the end of 
this section. 

We denote with span(-) the span of the columns of the input matrix. More¬ 
over, S is defined in (12), 5"^ is its complement. 

Two factors are considered that affect the conditioning of the problem. 
First, the usual condition number of the second-order derivative of l/2||Aa; — 
b\\l in (0, which is simply k{A'^A) = Xi{A'^A)/\n{A'^A), where 0 < A„ < 
A„_i < • • • < Ai are the eigenvalues of matrix A'''A. It is well-known that the 
larger n{A'^A) is, the more difficult problem 0 becomes. 

Second, the conditioning of the optimal solution x* of problem ([^. Let us 
explain what we mean by the conditioning of x*. We define a constant p > 0 
and the index set Ip := {i S {1,2, • • • , n} | Xi[A^ A) > p}. Furthermore, we 


r = \lp\ and matrix Gp 


define the projection Pp = GpGJ, where Gp G 
has as columns the eigenvectors of matrix A which correspond to eigenvalues 
with indices in Ip. Then, the conditioning of x* is defined as 


Kp{x*) 


+ 00 , 


if PpX* yf 0 
otherwise 


(17) 


For the case PpX* yf 0, the denominator of 0 is the mass of x* which exists 
in the space spanned by eigenvectors of A'''A which correspond to eigenvalues 
that are larger than or equal to p. 

Let us assume that there exists some p which satisfies Xn{A'^ A) < p 
Ai(A'''A). If Kp{x*) is large, i.e., ||Ppa::*||2 is close to zero, then the majority 
of the mass of x* is “hidden” in the space spanned by eigenvectors which 
correspond to eigenvalues that are smaller than p, i.e., the orthogonal space 
of span(Gp). In Section we referred to methods that do not incorporate 
information which correspond to small eigenvalues of A'^ A. Therefore, if the 
previous scenario holds, then we expect the performance of such methods to 
degrade. In Section we empirically verify the previous arguments. 

If matrix A has more columns than rows then the previous definitions of 
conditioning of problem (Q are incorrect and need to be adjusted. Indeed, if 
m < n and rank(A) = min(m,n) = m, then A'^A is a rank deficient matrix 
which has m nonzero eigenvalues and n — m zero eigenvalues. However, we can 





14 


Kimon Fountoulakis, Jacek Gondzio 


0|-^^^-r 

• • 

2 - • • • • 

• • 

4 - • • 

6 - 

• • • • 

ms- 

10 - . . . . 

12 - • • 

• • 

14 - • • • • 

16 - • • 

0 2 4 6 8 

n 

(a) A = (PGP)SGT 


0|-1-^-1-r 

• • • 

2 . 

• • • • • 

4 - • • • • • 

• • • 

6 - • • • • 

• • • • • 

••• 

• • • 

10 - • • • • • 

• • • • 

12 - • • • • • 

• • • • • 

14- •••••• 

• • • • • • 

16 - • • • 

0 2 4 6 8 

n 
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(c) A = {PGG 2 GP)E{GG 2 Gy 
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(d) A = (PG2GG2GP)E(G2GG2Gy 



Fig. 3: Sparsity pattern of four examples of matrix A, the Givens rotations G 
and G 2 are explained in Subsections 4.1 and 


4.2 












































Title Suppressed Due to Excessive Length 


15 

























1 


• 

• 











• 

• 

• 







2 


• 

• 









2 


• 

• 

• 

• 

• 





3 




• 

• 







3 


• 

• 

• 

• 

• 





4 




, 








4 



, 

, 


, 


, 



n 












n 











5 






• 

• 





5 



• 

• 

• 

• 

• 

• 



6 






• 

• 





6 





• 

• 

• 

• 

• 


7 








• 

• 



7 





• 

• 

• 

• 

• 


8 








• 

• 



8 







• 

■ 

• 


g 












9 

































n n 


(a) AT A, A = (PGP)EGT 


(b) AT A, A = (PG2GP)E(G2G)t 
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(c) AT A, A = (PGG 2 GP)E(GG 2 G)t (d) At A, A = (PG 2 GG 2 GP)E(G 2 GG 2 G)t 


Fig. 4: Sparsity pattern of four examples of matrix A'''A, where the Givens 
rotations G and G2 are explained in Subsections 


4.1 


and 


4.2 


restrict the conditioning of the problem to a neighbourhood of the optimal 
solution of X*. In particular, let us define a neighbourhood of x* so that all 
points in this neighbourhood have nonzeros at the same indices as x* and zeros 
elsewhere, i.e. Af := {a; € M” | Xi 7^ 0 Vi G 5, Xi = 0 Vi € In this case 
an important feature to determine the conditioning of the problem is the ratio 
of the largest and the smallest nonzero eigenvalues of AjAs, where A5 is a 
submatrix of A built of columns of A which belong to set S. 
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6 Construction of the optimal solution 

Two different techniques are employed to generate the optimal solution x* for 
the experiments presented in Section The first procedure suggests a simple 
random generation of a;*, see Procedure OsGen below. 


1: Given the required number s < min(m, n) of nonzeros in fc* and a positive constant 
7 > 0 do: 

2: Choose a subset S C { 1 , 2, • • • , n} with \S\ = s. 

3: Vi E 5" choose x* uniformly at random in [— 7 , 7 ] and Vj ^ S set Xj = 0. 

Procedure : Optimal solution Generator (OsGen) 


The second and more complicated approach is given in Procedure OsGen2. 
This procedure is applicable only in the case that m> n, however, it can be 
easily extended to the case oi m < n. We focus in the former scenario since 
all experiments in Section are generated by setting m > n. 

1: Given the required number s < min(m, n) of nonzeros in x*, a positive constant 7 > 0, 
the right singular vectors G and singular values E of matrix A do: 

2: Solve approximately 


X* ■.= argmin WC^x — E)~^ln\\'^ 

(18) 

subject to: ||fc||o < s, 

where In E is a vector of ones and || • ||o is the zer o no rm which returns the number 
of nonzero components of the input vector. Problem l |18[ l can be solved approximately 
using an Orthogonal Matching Pursuit (OMP) [28] solver implemented in [3]. 


Procedure : Optimal solution Generator 2 (OsGen2) 


The aim of Procedure OsGen2 is to find a sparse x* with Kp{x*) arbitrar¬ 
ily large for some p in the interval Xn{A'^A) < p <C Ai(A'''A). In particular, 
OsGen2 will return a sparse x* which can be expressed as x* = Gv. The coef¬ 
ficients V are close to the inverse of the eigenvalues of matrix A'^A. Intuitively, 
this technique will create an x* which has strong dependence on subspaces 
which correspond to small eigenvalues of A'^A. The constant 7 is used in order 
to control the norm of x*. 

The sparsity constraint in problem (18), i.e., |7||o < s, makes the approx¬ 
imate solution of this problem difficult when we use OMP, especially in the 
case that s and n are large. To avoid this expensive task we can ignore the 
sparsity constraint in (18). Then we can solve exactly and inexpensively the 
unconstrained problem and finally we can project the obtained solution in the 
feasible set defined by the sparsity constraint. Obviously, there is no guaran¬ 
tee that the projected solution is a good approximation to the one obtained 
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in Step 2 of Procedure OsGen2. However, for all experiments in Section 
that we applied this modification we obtained sufficiently large Kp{x*). This 
means that our objective to produce ill-conditioned optimal solutions was met, 
while we kept the computational costs low. The modified version of Procedure 
OsGen2 is given in Procedure OsGen3. 

1: Given the required number s < mm{m,n) of nonzeros in x*, two non negative integers 
Si and S 2 such that si + S 2 = s, a positive constant 7 > 0, the right singular vectors G 
and singular values E of matrix A do: 

2: Solve exactly 

X* := argmin x —E)~^ln\\'^ /'iqn 

where G is a vector of ones. Problem ( |19| l can be solved exactly and inexpensively 
because C" is an orthonormal matrix. 

3: Maintain the positions and the values of the si smallest and S 2 largest (in absolute 
values) components of x*. 

4: Set the remaining components of x* to zero. 


Procedure : Optimal solution Generator 3 (OsGenS) 


7 Existing Problem Generators 


So far in Section 3.1 we have described in details our proposed problem gen¬ 
erator. Moreover, in Section we have described how to construct matrices 
A such that the proposed generator is scalable with respect to the number of 
unknown variables. We now briefly describe existing problem generators and 
explain how our propositions add value to the existing approaches. 

Given a regularization parameter r existing problem generators are looking 
for A, b and x* such that the optimality conditions of problem Q : 


A^Ax* -b) G -TC 


( 20 ) 


are satisfied. For example, in |32] the author fixes a vector of noise e and 
an optimal solution x* and then finds A and b such that (20) is satisfied. In 
particular, in |32] matrix A = BD is used, where H is a fixed matrix and D is 
a scaling matrix such that the following holds. 

{BOye G T5||a;*||i, 

Matrix D is trivial to calculate, see Section 6 in [52] for details. Then by 


setting b = Ax* + e (20) is satisfied. The advantage of this generator is that it 


alfows control of the noise vector e, in comparison to our approach where the 
vector noise e has to be determined by solving a linear system. On the other 
hand, one does not have direct control over the singular value decomposition 
of matrix A, since this depends on matrix D, which is determined based on 
the fixed vectors e and x*. 

Another representative example is proposed in [26j . This generator, which 
we discovered during the revision of our paper, proposes the same setting as 
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in our paper. In particular, given A, x* and t one can construct a vector h 
(or a noise vector e) such that (20) is satisfied. However, in [26] the author 
suggests that b can be found using a simple iterative procedure. Depending on 
matrix A and how ill-conditioned it is, this procedure might be slow. In this 
paper, we suggest that one can rely on numerical linear algebra tools, such 
as Givens rotation, in order to inexpensively construct b (or a noise vector e) 
using straightforwardly scalable operations. Additionally, we show in Section 
([^ that a simple construction of matrix A is sufficient to extensively test the 
performance of methods. 


8 Numerical Experiments 

In this section we study the performance of state-of-the-art first- and second- 
order methods as the conditioning and the dimensions of the problem in¬ 
crease. The scripts that reproduce the experiments in this section as well as 
the problem generators that are described in Section can be downloaded 
from: http://www.maths.ed.ac.uk/ERGD/trillion/ 


8.1 State-of-the-art Methods 


HHnidKIlllSlIMlIST] methods have been developed for the solution of problem 
([^. In this section we examine the performance of the following state-of-the- 
art methods. Notice that the first three methods FISTA, PSSgb and PCDM 
do not perform smoothing of the t'l-norm, while pdNCG does. 


— FISTA (Fast Iterative Shrinkage-Thresholding Algorithm) [5] is an opti¬ 
mal first-order method for problem ([^, which adheres to the structure of 
GFrame. At a point x, FISTA builds a convex function: 


Qr{y;x) := T\\y\\i + ^\\Ax-b\\^ + {A'^{Ax - b)y {y - x) + ^\\y - x\\l, 

where L is an upper bound of Xmax(A'^A), and solves subproblem ([^ ex¬ 
actly using shringkage-thresholding naiizi. An efficient implementation of 
this algorithm can be found as part of TFOCS (Templates for First-Order 
Conic Solvers) package [6| under the name N83. In this implementation 
the parameter L is calculated dynamically. 

— PCDM (Parallel Coordinate Descent Method) |3S| is a randomized paral¬ 
lel coordinate descent method. The parallel updates are performed asyn¬ 
chronously and the coordinates to be updated are chosen uniformly at 
random. Let w be the number of processors that are employed by PCDM. 
Then, at a point x, PCDM builds w convex approximations: 


Q\{yi\x) := T\y^\ + ^\\Ax - -f {Aj{Ax - b)){yi - xi) + ^{yi - Xif, 
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Vi = 1, 2, • • • , 137, where Ai is the ith column of matrix A and Li = (A'^A)ii 
is the ith diagonal elem ent o f matrix A'^ A and /3 is a positive constant which 
is defined in Subsection 


8.3 


The Q], functions are minimized exactly using 

shrinkage-thresholding. 

PSSgb (Projected Scaled Subgradient, Gafni-Bertsekas variant) [37] is a 
second-order method. At each iteration of PSSgb the coordinates are sep¬ 
arated into two sets, the working set W and the active set A. The working 
set consists of all coordinates for which, the current point x is nonzero. 
The active set is the complement of the working set W. The following local 
quadratic model is build at each iteration 

Qriy.x) := fr{x) + {Vfr{,x)y {y - x) L ]^{y - xYH{y - x), 


where Vfr{x) is a sub-gradient of fr at point x with the minimum Eu¬ 
clidean norm, see Subsection 2.2.1 in m for details. Moreover, matrix H 
is defined as: 


H = 


Hw 0 
0 Ha ’ 


where iJvv is an L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb- 
Shanno) Hessian approximation with respect to the coordinates W and 
Ha is a positive diagonal matrix. The diagonal matrix Ha is a scaled 
identity matrix, where the Shanno-Phua/Barzilai-Borwein scaling is used, 
see Subsection 2.3.1 in m for details. The local model is minimized exactly 
since the inverse of matrix H is known due to properties of the L-BFGS 
Hessian approximation iLvv- 

- pdNGG (primal-dual Newton Gonjugate Gradients) [16] is also a second- 
order method. At every point x pdNCG constructs a convex function Qr 
exactly as described for (§. The subproblem @ is solved inexactly by 
reducing it to the linear system: 


y^fit{x){y-x) = -yfit{x), 


which is solved approximately using preconditioned Gonjugate Gradients 
(PGG). A simple diagonal preconditioner is used for all experiments. The 
preconditioner is the inverse of the diagonal of matrix f!^{x). 


8.2 Implementation details 


Solvers pdNGG, FISTA and PSSgb are implemented in MATLAB, while solver 
PGDM is a C-l—I- implementation. We expect that the programming language 
will not be an obstacle for pdNGG, FISTA and PSSgb. This is because these 
methods rely only on basic linear algebra operations, such as the dot product, 
which are implemented in C-l—h in MATLAB by default. The experiments in 
8 .6] were performed on a Dell PowerEdge R920 running 


Subsections 8.4 8.5 


Redhat Enterprise Linux with four Intel Xeon E7-4830 v2 2.2GHz processors, 
20MB Cache, 7.2 GT/s QPI, Turbo (4xlOCores). 
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The huge scale experiments in Subsection 8.9 were performed on a Cray 
XC30 MPP supercomputer. This work made use of the resources provided by 
ARCHER (http://www.archer.ac.uk/), made available through the Edin¬ 
burgh Compute and Data Eacility (ECDF) (http://www.ecdf.ed.ac.uk/). 
According to the most recent list of commercial supercomputers, which is pub¬ 
lished in TOP500 list (http://www.top500.org), ARCHER is currently the 
25th fastest supercomputer worldwide out of 500 supercomputers. ARCHER 
has a total of 118,080 cores with performance 1,642.54 TFlops/s on LIN- 
PACK benchmark and 2,550.53 TFlops/s theoretical peak perfomance. The 
most computationally demanding experiments which are presented in Subsec¬ 
tion [8]^ required more than half of the cores of ARCHER, i.e., 65,536 cores 
out of 118,080. 


8.3 Parameter tuning 

We describe the most important parameters for each solver, any other parame¬ 
ters are set to their default values. For pdNCG we set the smoothing parameter 
to /r = 10“®, this setting allows accurate solution of the original problem with 
an error of order [3T]. For pdNCG, PCG is terminated when the rela¬ 

tive residual is less that 10“^ and the backtracking line-search is terminated 
if it exceeds 50 iterations. Regarding FISTA the most important parameter 
is the calculation of the Lipschitz constant L, which is handled dynamically 
by TFOCS. For PCDM the coordinate Lipschitz constants Li \/i = 1,2, ■■ ■ ,n 
are calculated exactly and parameter (3 = 1 + {uj — l){ru — l)/{n — 1), where 
Lo changes for every problem since it is the degree of partial separability of 
the fidelity function in which is easily calculated (see [3S]), and ro = 40 is 
the number of cores that are used. For PSSgb we set the number of L-BFGS 
corrections to 10. 

We set the regularization parameter r = 1, unless stated otherwise. We run 
pdNCG for sufficient time such that the problems are adequately solved. Then, 
the rest of the methods are terminated when the objective function fr in ([^ 
is below the one obtained by pdNCG or when a predefined maximum number 
of iterations limit is reached. All comparisons are presented in figures which 
show the progress of the objective function against the wall clock time. This 
way the reader can compare the performance of the solvers for various levels 
of accuracy. We use logarithmic scales for the wall clock time and terminate 
runs which do not converge in about 10® seconds, i.e., approximately 27 hours. 


8.4 Increasing condition number of A'''A 

In this experiment we present the performance of FISTA, PCDM, PSSgb and 
pdNCG for increasing condition number of matrix A'^A when Procedure Os- 
Gen is used to construct the optimal solution x*. We generate six matrices A 
and two instances of x* for every matrix A; twelve instances in total. 
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The singular value decomposition of matrix A is A = T'C', where S is 
the matrix of singular values, the columns of matrices Im and G are the left 
and right singular vectors, respectively, see Subsection |4.1| for details about 
the construction of matrix G. The singular values of matrices A are chosen 
uniformly at random in the intervals [0,10'^], where g = 0,1, • • • ,5, for each of 
the six matrices A. Then, all singular values are shifted by 10“^. The previous 
resulted in a condition number of matrix A'^ A which varies from 10^ to 10^^ 
with a step of times 10^. The rotation angle 9 of matrix G is set to 27r/3 radians. 
Matrices A have n = 2^^ columns, m = 2n rows and rank n. The optimal 
solutions X* have s = nonzero components for all twelve instances. 

For the first set of six instances we set 7 = 10 in OsGen, which resulted 
in ^ 0 . 1 ( 2 ;*) ~ 1 for all experiments. The results are presented in Figure]^ 
For these instances PCDM is clearly the fastest for k{A'^A) < 10"*, while for 
k{A'^A) > 10® pdNCG is the most efficient. 

For the second set of six instances we set 7 = 10® in Procedure OsGen, 
which resulted in the same «;o.i(a;*) as before for every matrix A. The results 
are presented in Figure For these instances PCDM is the fastest for very 
well conditioned problems with k{A'^ A) < 10^, while pdNCG is the fastest for 
k(AtA) > 10^. 

We observed that pdNCG required at most 30 iterations to converge for 
all experiments. For FISTA, PCDM and PSSgb the number of iterations was 
varying between thousands and tens of thousands iterations depending on 
the condition number of matrix A'^ A; the larger the condition number the 
more the iterations. However, the number of iterations is not a fair metric 
to compare solvers because every solver has different computational cost per 
iteration. In particular, FISTA, PCDM and PSSgb perform few inner products 
per iteration, which makes every iteration inexpensive, but the number of 
iterations is sensitive to the condition number of matrix A'^ A. On the other 
hand, for pdNCG the empirical iteration complexity is fairly stable, however, 
the number of inner products per iteration (mainly matrix-vector products 
with matrix A) may increase as the condition number of matrix A’’’A increases. 
Inner products are the major computational burden at every iteration for all 
solvers, therefore, the faster an algorithm converged in terms of wall-clock 
time the less inner products that are calculated. In Figuresandwe display 
the objective evaluation against wall-clock time (log-scale) to facilitate the 
comparison of different algorithms. 


8.5 Increasing condition number of A^A: non-trivial construction of x* 

In this experiment we examine the performance of the methods as the con¬ 
dition number of matrix A'''A increases, while the optimal solution x* is 
generated using Procedure OsGen3 (instead of OsGen) with 7 = 100 and 
Si = S 2 = s/2. Two classes of instances are generated, each class consists 
of four instances (A, a:*) with n = 2^^, m = 2n and s = n/2^. Matrix A is 
constructed as in Subsection |8.4| The singular values of matrices A are cho- 
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Fig. 5: Performance of pdNCG, FISTA, PCDM and PSSgb on synthetic S- 
LS problems for increasing condition number of matrix A and 7 = 10 in 
Procedure OsGen. The axes are in log-scale. In this figure fr denotes the 
objective value that was obtained by each solver. 
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Fig. 6 : Performance of pdNCG, FISTA, PCDM and PSSgb on a synthetic S- 
LS problem for increasing condition number of matrix A'^A and 7 = 10^ in 
Procedure OsGen. The axes are in log-scale. 
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sen uniformly at random in the intervals [0,10'^], where g = 0,1, • • • ,3, for all 
generated matrices A. Then, all singular values are shifted by 10“^. The pre¬ 
vious resulted in a condition number of matrix A'''A which varies from 10^ to 
10® with a step of times 10^. The condition number of the generated optimal 
solutions was on average ^ 0 , 1 ( 2 ;*) « 40. 

The two classes of experiments are distinguished based on the rotation 
angle 0 that is used for the composition of Givens rotations G. In particular, 
for the first class of experiments the angle is 0 = 27r/10 radians, while for 
the second class of experiments the rotation angle is 0 = 27r/10® radians. The 
difference between the two classes is that the second class consists of matrices 
A'^'A for which, a major part of their mass is concentrated in the diagonal. 
This setting is beneficial for PCDM since it uses information only from the 
diagonal of matrices A'''A. This setting is also beneficial for pdNCG since it 
uses a diagonal preconditioner for the inexact solution of linear systems at 
every iteration. 

The results for the first class of experiments are presented in Figure 
For instances with k{A'^A) > 10® PCDM was terminated after 1,000,000 
iterations, which corresponded to more than 27 hours of wall-clock time. 

The results for the second class of experiments are presented in Figure 
Notice in this figure that the objective function is only slightly reduced. This 
does not mean that the initial solution, which was the zero vector, was nearly 
optimal. This is because noise with large norm, i.e., \\Ax* — b\\ is large, was 
used in these experiments, therefore, changes in the optimal solution did not 
have large affect on the objective function. 


8.6 Increasing dimensions 

In this experiment we present the performance of pdNCG, FISTA, PCDM 
and PSSgb as the number of variables n increases. We generate four instances 
where the number of variables n takes values 2 ^®, 2 ^^, 2 ^"^ and 2 ^®, respectively. 
The singular value decomposition of matrix A is A = AC'. The singular values 
in matrix A are chosen uniformly at random in the interval [ 0 , 10 ] and then 
are shifted by 10“^, which resulted in k(A‘''A) « 10"^. The rotation angle 0 of 
matrix G is set to 27r/10 radians. Moreover, matrices A have m = 2n rows 
and rank n. The optimal solutions x* have s = njl!^ nonzero components for 
each generated instance. For the construction of the optimal solutions x* we 
use Procedure OsGenS with 7 = 100 and s\ = S 2 = which resulted in 
^ 0 . 1 ( 2 ;*) « 3 on average. 

The results of this experiment are presented in Figure Notice that all 
methods have a linear-like scaling with respect to the size of the problem. 


8.7 Increasing density of matrix A‘''A 

In this experiment we demonstrate the performance of pdNCG, FISTA, PCDM 
and PSSgb as the density of matrix A'''A increases. We generate four instances 
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Fig. 7: Performance of pdNCG, FISTA, PCDM and PSSgb on synthetic S- 
LS problems for increasing condition number of matrix A'''A. The optimal 
solutions have been generated using Procedure OsGenS with 7 = 100 and 
Si = S 2 = s/2. The axes are in log-scale. The rotation angle 6* in G was 27r/10. 
For condition number k{A'^A) > 10® PGDM was terminated after 1,000,000 
iterations, which corresponded to more than 27 hours of wall-clock time. 


(A,a;*). For the first experiment we generate matrix A = AC', where A 
is the matrix of singular values, the columns of matrices I^n and G are the 
left and right singular vectors, respectively. For the second experiment we 
generate matrix A = S{G 2 Gy, where the columns of matrices Im and G 2 G 
are the left and right singular vectors of matrix A, respectively; G 2 has been 
defined in Subsection |4.2[ Finally, for the third and fourth experiments we have 
A = S(GG 2 Gy and A = A(G 2 GG 2 G)''', respectively. For each experiment 
the singular values of matrix A are chosen uniformly at random in the interval 
[0,10] and then are shifted by 10“^, which resulted in k{A'^A) « 10^. The 
rotation angle 9 of matrices G and G 2 is set to 27r/10 radians. Matrices A have 
m = 2n rows, rank n and n = 2^^. The optimal solutions x* have s = n/2^ 
nonzero components for each experiment. Moreover, Procedure OsGenS is used 
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(a) k(AtA) = 10^ (b) K(ATy4) = lO'^ 
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Fig. 8: Performance of pdNCG, FISTA, PCDM and PSSgb on synthetic S- 
LS problems for increasing condition number of matrix A'''A. The optimal 
solutions have been generated by using Procedure OsGenS with 7 = 100 and 
Si = S2 = s/2. The rotation angle d in G was 27r/10^. The axes are in log-scale. 


with 7 = 100 and Si = S2 = s/2 for the construction of x* for each experiment, 
which resulted in « 2 on average. 

The results of this experiment are presented in FigureObserve, that all 
methods had a robust performance with respect to the density of matrix A. 


8.8 Varying parameter r 

In this experiment we present the performance of pdNGG, FISTA, PGDM and 
PSSgb as parameter r varies from 10“'^ to 10^ with a step of times 10^. We 
generate four instances (A,a;*), where matrix A = AC' has m = 2n rows, 
rank n and n = 2^^. The singular values of matrices A are chosen uniformly 
at random in the interval [0,10] and then are shifted by 10“^, which resulted 
in k{A'^A) « 10^ for each experiment. The rotation angles 9 for matrix G 
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wall clock time (seconds) 



10"' 10“ 10' 10^ 10^ 

wall clock time (seconds) 


(a) n = 2^0 


(b) n = 2^2 




wall clock time (seconds) wall clock time (seconds) 

(c) n = 2^4 (d) n = 226 

Fig. 9: Performance of pdNCG, FISTA, PCDM and PSSgb on a synthetic S-LS 
problem for increasing number of variables n. The axes are in log-scale. 


in A is set to 27r/10 radians. The optimal solution x* has s = n/2^ nonzero 
components for all instances. Moreover, the optimal solutions are generated 
using Procedure OsGenS with 7 = 100, which resulted in Ko.i(a:*) « 3 for all 
four instances. 


The performance of the methods is presented in Figure [TT| Notice in Sub¬ 
figure [T^ that for pdNGG the objective function f^. is not always decreasing 
monotonically. A possible explanation is that the backtracking line-search of 
pdNGG, which guarantees monotonic decrease of the objective function m, 
terminates in case that 50 backtracking iterations are exceeded, regardless if 
certain termination criteria are satisfied. 
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Fig. 10: Performance of pdNCG, FISTA, PCDM and PSSgb on synthetic S- 
LS problems for increasing number of nonzeros of matrix A. The axes are in 
log-scale. 


8.9 Performance of a second-order method on huge scale problems 

We now present the performance of pdNCG on synthetic huge scale (up to one 
trillion variables) S-LS problems as the number of variables and the number 
of processors increase. 

We generate six instances (A,a:*), where the number of variables n takes 
values 2^^, 2^"^, 2^®, 2^® and 2^^. Matrices A = have m = 2n rows and 
rank n. The singular values ai for i = 1, 2, • • • , n of matrices A are set to 10“^ 
for odd Vs and 10^ for even Vs. The rotation angle 9 of matrix G is set to 27r/3 
radians. The optimal solutions x* have s = n/2^® nonzero components for 
each experiment. In order to simplify the practical generation of this problem 
the optimal solutions x* are set to have s/2 components equal to —10^ and 
the rest of nonzero components are set equal to 10“^. 

Details of the performance of pdNCG are given in Table Observe the 
nearly linear scaling of pdNCG with respect to the number of variables n and 
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Fig. 11: Performance of pdNCG, FISTA, PCDM and PSSgb on synthetic S- 
LS problems for various values of parameter r. The axes are in log-scale. 
Observe in Subfigure |lld| that for pdNCG the objective function fr is not 
always decreasing monotonically. This is due to the backtracking line-search of 
pdNCG, which terminates in case that the maximum number of backtracking 
iterations is exceeded regardless if certain termination criteria are satisfied. 


the number of processors. For all experiments in Table pdNCG required 8 
Newton steps to converge, 100 PCG iterations per Newton step on average, 
where every PCG iteration requires two matrix-vector products with matrix 

A. 


9 Conclusion 

In this paper we developed an instance generator for -regularized sparse 
least-squares problems. The generator is aimed for the construction of very 
large-scale instances. Therefore it scales well as the number of variables in¬ 
creases, both in terms of memory requirements and time. Additionally, the 
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n 

Processors 

Memory (terabytes) 

Time (seconds) 

230 

64 

0.192 

1,923 

232 

256 

0.768 

1,968 

234 

1024 

3.072 

1,986 

236 

4096 

12.288 

1,970 

to 

CO 

00 

16384 

49.152 

1,990 

240 

65536 

196.608 

2,006 


Table 1: Performance of pdNCG for synthetic huge scale S-LS problems. All 
problems have been solved to a relative error of order 10“^ of the obtained 
solution 


generator allows control of the conditioning and the sparsity of the problem. 
Examples are provided on how to exploit the previous advantages of the pro¬ 
posed generator. We believe that the optimization community needs such a 
generator to be able to perform fair assessment of new algorithms. 

Using the proposed generator we constructed very large-scale sparse in¬ 
stances (up to one trillion variables), which vary from very well-conditioned to 
moderately ill-conditioned. We examined the performance of several represen¬ 
tative first- and second-order optimization methods. The experiments revealed 
that regardless of the size of the problem, the performance of the methods cru¬ 
cially depends on the conditioning of the problem. In particular, the first-order 
methods PCDM and FISTA are faster for problems with small or moderate 
condition number, whilst, the second-order method pdNCG is much more ef¬ 
ficient for ill-conditioned problems. 
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